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Abstract —Collaborative filtering algorithms are important 
building blocks in many practical recommendation systems. 
For example, many large-scale data processing environments 
include collaborative filtering models for which the Alternating 
Least Squares (ALS) algorithm is used to compute latent factor 
matrix decompositions. In this paper, we propose an approach 
to accelerate the convergence of parallel ALS-based optimization 
methods for collaborative filtering using a nonlinear conjugate 
gradient (NCG) wrapper around the ALS iterations. We also 
provide a parallel implementation of the accelerated ALS-NCG 
algorithm in the Apache Spark distributed data processing 
environment, and an efficient line search technique as part of 
the ALS-NCG Implementation that requires only one pass over 
the data on distributed datasets. In serial numerical experiments 
on a linux workstation and parallel numerical experiments on a 
16 node cluster with 256 computing cores, we demonstrate that 
the combined ALS-NCG method requires many fewer iterations 
and less time than standalone ALS to reach movie rankings 
with high accuracy on the MovleLens 20M dataset. In parallel, 
ALS-NCG can achieve an acceleration factor of 4 or greater in 
clock time when an accurate solution is desired; furthermore, 
the acceleration factor Increases as greater numerical precision 
is required in the solution. In addition, the NCG acceleration 
mechanism is efficient in parallel and scales linearly with problem 
size on synthetic datasets with up to nearly 1 billion ratings. The 
acceleration mechanism is general and may also be applicable to 
other optimization methods for collaborative filtering. 

keywords-Recommendation systems, collaborative filtering, 
parallel optimization algorithms, matrix factorization, Apache 
Spark, Big Data, scalable methods. 


I. Introduction and Background 

Recommendation systems are designed to analyze available 
user data to recommend items such as movies, music, or 
other goods to consumers, and have become an increasingly 
important part of most successful online businesses. One 
strategy for building recommendation systems is known as 
collaborative filtering, whereby items are recommended to 
users by collecting preferences or taste information from 
many users (see, e.g., Q, El). Collaborative filtering methods 
provide the basis for many recommendation systems El and 
have been used by online businesses such as Amazon H, 
Netflix 0, and Spotify 0. 
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An important class of collaborative filtering methods are 
latent factor models, for which low-rank matrix factorizations 
are often used and have repeatedly demonstrated better ac¬ 
curacy than other methods such as nearest neighbor models 
and restricted Boltzmann machines Q, Q. A low-rank latent 
factor model associates with each user and with each item 
a vector of rank Uf, for which each component measures 
the tendency of the user or item towards a certain factor or 
feature. In the context of movies, a latent feature may represent 
the style or genre (i.e. drama, comedy, or romance), and the 
magnitude of a given component of a user’s feature vector is 
proportional to the user’s proclivity for that feature (or, for 
an item’s feature vector, to the degree to which that feature 
is manifested by the item). A low-rank matrix factorization 
procedure takes as its input the user-item ratings matrix R, 
in which each entry is a numerical rating of an item by a 
user and for which typically very few entries are known. The 
procedure then determines the low-rank user (U) and item 
(M) matrices, where a column in these matrices represents a 
latent feature vector for a single user or item, respectively, 
and R Ri U^M for the known values in R. Once the 
user and item matrices are computed, they are used to build 
the recommendation system and predict the unknown ratings. 
Computing user and item matrices is the first step in building 
a variety of recommendation systems, so it is important to 
compute the factorization of R quickly. 

The matrix factorization problem is closely related to the 
singular value decomposition (SVD), but the SVD of a matrix 
with missing values is undefined. However, since R « U^M, 
one way to find the user and item matrices is by minimizing 
the squared difference between the approximated and actual 
value of the known ratings in R. Minimizing this difference 
is typically done by one of two algorithms: stochastic gradient 
descent (SGD) or alternating least squares (ALS) 0, il- ALS 
can be easily parallelized and can efficiently handle models 
that incorporate implicit data (e.g. the frequency of a user’s 
mouse-clicks or time spent on a website) Eol, but it is well- 
known that ALS can require a large number of iterations to 
converge. Thus, in this paper, we propose an approach to 
significantly accelerate the convergence of the ALS algorithm 
for computing the user and item matrices. We use a nonlinear 
optimization algorithm, specifically the nonlinear conjugate 


gradient (NCG) algorithm El, as a wrapper around ALS to 
significantly accelerate the convergence of ALS, and thus refer 
to this combined algorithm as ALS-NCG. Alternatively, the 
algorithm can be viewed as a nonlinearly preconditioned NCG 
method with ALS as the nonlinear preconditioner El- Our 
approach for accelerating the ALS algorithm using the NCG 
algorithm can be situated in the context of recent research 
activity on nonlinear preconditioning for nonlinear iterative 
solvers El, m, El, El, lEl- Some of the ideas date back 
as far as the 1960s El, El, but they are not well-known and 
remain under-explored experimentally and theoretically El- 

Parallel versions of collaborative filtering and recommen¬ 
dation are of great interest in the era of big data El, El, 
El- For example, the Apache Spark data processing environ¬ 
ment II 22 I contains a parallel implementation of ALS for the 
collaborative filtering model of El, 11- In El an advanced 
distributed SGD method is described in Hadoop environments, 
followed by work in 1^ that considers algorithms based on 
ALS and SGD in environments that use the Message Passing 
Interface (MPI). Scalable coordinate descent approaches for 
collaborative filtering are proposed in El, also using the MPI 
framework. In addition to producing algorithmic advances, 
these papers have shown that the relative performance of the 
methods considered is often strongly influenced by the perfor¬ 
mance characteristics of the parallel computing paradigm that 
is used to implement the algorithms (e.g., Hadoop or MPI). In 
this paper we implement our proposed ALS-NCG algorithm in 
parallel using the Apache Spark framework, which is a large- 
scale distributed data processing environment that builds on 
the principles of scalability and fault tolerance that are instru¬ 
mental in the success of Hadoop and MapReduce. However, 
Spark adds crucial new capabilities in terms of in-memory 
computing and data persistence for iterative methods, and 
makes it possible to implement more elaborate algorithms with 
significantly better performance than is feasible in Hadoop. As 
such. Spark is already being used extensively for advanced 
big data analytics in the commercial setting ll24l . The parallel 
Spark implementation of ALS for the collaborative filtering 
model of El . fSl forms the starting point for applying the 
parallel acceleration methods proposed in this paper. 

Our contributions in this paper are as follows. The specific 
optimization problem is formulated in Section [III and the 
accelerated ALS-NCG algorithm is developed in Section |III] 
In Section we study the convergence enhancements of 
ALS-NCG compared to standalone ALS in small serial tests 
using subsets of the MovieLens 20M dataset 1^ . Section E 
describes our parallel implementation in Sparlo and Section 
[Vll contains results from parallel performance tests on a high- 
end computing cluster with 16 nodes and 256 cores, using 
both the full MovieLens 20M dataset and a large synthetic 
dataset sampled from the MovieLens 20M data with up to 6 
million users and 800 million ratings. We find that ALS-NCG 
converges significantly faster than ALS in both the serial and 
distributed Spark settings, demonstrating the overall speedup 
provided by our algorithmic acceleration. 
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H. Problem Description 

The acceleration approach we propose in this paper is 
applicable to a broad class of optimization methods and 
collaborative filtering models. For definiteness, we choose a 
specific latent factor model, the matrix factorization model 
from lO and Ell , and a specific optimization method, ALS 
El . Given the data we use, the model is presented in terms of 
users and movies instead of the more generic users and items 
framework. 

Let the matrix of user-movie rankings be represented by 
^ = {Ay}n„xnm where is the rating given to movie j 
by user i, Uu is the number of users, and is the number 
of items. Note that for any user i and movie j, the value 
of Tij is either a real number or is missing, and in practice 
very few values are known. For example, the MovieLens 20M 
dataset El with 138,493 users and 27,278 movies contains 
only 20 million rankings, accounting for less than 1% of the 
total possible rankings. In the low-rank factorization of R with 
Uf factors, each user i is associated with a vector S 
{i = 1,, Uu), and each movie j is associated with a vector 
m.j G K"!" (j = 1,..., rim)- The elements of rrij measure the 
degree that movie j possesses each factor or feature, and the 
elements of similarly measures the affinity of user i for 
each factor or feature. The dot product ufnij thus captures 
the interaction between user i and movie j, approximating user 
i’s rating of movie j as r^- « ufmj. Denoting U = [u^] G 
]^n/xn„ jjjg yggj. feature matrix and M = [nij] G 
as the movie feature matrix, our goal is to determine U and 
M such that R « U^M by minimizing the following squared 
loss function: 

£a(R,U,M)= ^ 

A ( X!IIW f + X!II II ^) ’ 

i 3 

where X is the index set of known in R, denotes the 
number of ratings by user i, and Umj is the number of ratings 
of movie j. The term A(X;i ||ui||^ -b ll“iiP) i^ 

a Tikhonov regularization El term commonly included in 
the loss function to prevent overfitting. The full optimization 
problem can be stated as 

min£A(R,U,M). (2) 

U.M 

HI. Accelerating ALS Convergence by NCG 
A. Alternating Least Squares Algorithm 

The optimization problem in (|2|i is not convex. However, 
if we fix one of the unknowns, either U or M, then the 
optimization problem becomes quadratic and we can solve for 
the remaining unknown as a least squares problem. Doing this 
in an alternating fashion is the central idea behind ALS. 

Consider the first step of the ALS algorithm in which M is 
fixed. We can determine the least squares solution to ([B for 
each Ui by setting all the components of the gradient of ([T]) 


related to to zero: =0 M i,k where u^i is an element 

of U. Expanding the terms in the derivative of ([T]i, we have 

2(ufmj - nj)mkj + 2XnuiUki = 0 V i,k 

jeii 

=> mkjufmj + XuuiUki = rrikjnj V i, k, 

jeii jeii 

where mkj is an element of M. In vector form, the resultant 
linear system for any is 

+ An„.I)u, = V i, 

where I is the Uf x Uf identity matrix, li is the index set of 
movies user i has rated, and Mx^ represents the sub-matrix 
of M where columns j G li are selected. Similarly, 
is a row vector that represents the zth row of R with only the 
columns in li included. The explicit solution for is then 
given by 

Uj = A,'Vj V i, (3) 

where Ai = Mx^Mj. + An„^I, and = MxiR^(*,Ii). The 
analogous solution for the columns of M is found by fixing 
U, where each rrij is given by 

irij = Aj^Vj V j, (4) 

where = UxjU^. + Arim^I, Vj = UxjR(Tj,j)- Here, 
Zj is the index set of users that have rated movie j, Uj^ 
represents the sub-matrix of U G where columns i G 

Zj are selected, and R(Zj,j) is a column vector that represents 
the jth column of R with only the rows in Zj included. 

Algorithm [T] summarizes the ALS algorithm used to solve 
the optimization problem given in (l2]i. From (O and (|4]i, we 
note that each of the columns of U and M may be com¬ 
puted independently; thus ALS may be easily implemented in 
parallel, as in ll^ . However, the ALS algorithm can require 
many iterations for convergence, and we now propose an 
acceleration method for ALS that can be applied to both the 
parallel and serial versions. 

B. Accelerated ALS Algorithm 

In this section, we develop the accelerated ALS-NCG algo¬ 
rithm to solve the collaborative filtering optimization problem 


Algorithm 1: Alternating Least Squares (ALS) 

Output: U,M 

Initialize M with random values; 

while Stopping criteria have not been satisfied do 

for i = 1,... ,nu do 

I Ui ^ A'^v*; 

end 

for j = 1,..., rim do 

I ^ 

end 


minimizing O- In practice we found that the nonlinear conju¬ 
gate gradient algorithm by itself was very slow to converge to 
a solution of (|2]i, and thus do not consider it as an alternative 
to the ALS algorithm. Instead we propose using NCG to 
accelerate ALS, or, said differently, the combined algorithm 
uses ALS as a nonlinear preconditioner for NCG ina. 

The standard NCG algorithm is a line search algorithm 
in continuous optimization that is an extension of the CG 
algorithm for linear systems. While the linear CG algorithm 
is specifically designed to minimize the convex quadratic 
function ^(x) = ix^Ax — b^x, where A G is a 

symmetric positive definite matrix, the NCG algorithm can be 
applied to general constrained optimization problems of the 
form minxgRn /(x). Here, the minimization problem is (|2]i, 
where the matrices U and M are found by defining the vector 
X G R"/xas 

= [uf uj ... uf^ mf mf ... mf^] (5) 

with function /(x) = C\ as in ([T]). 

The NCG algorithm generates a sequence of iterates x,, 
i > 1, from the initial guess Xq using the recurrence relation 


Xfc+i = Xfe + akPk- 

The parameter > 0 is the step length determined by a line 
search along search direction p^, which is generated by the 
following rule: 


Pk+i = -gk+i + Pk+iPk, Po = -go, (6) 


where fik+i is the update parameter and = V/(xfc) is the 
gradient of /(x) evaluated at x^. The update parameter fik+i 
can take on various different forms. In this paper, we use the 
variant of (3k+i developed by Polak and Ribiere lIZTl : 

^ gfc+i(Sfc+i gfc) 

Pk+i — -X-• GI 

gfc gfe 

Note that if a convex quadratic function is optimized using 
the NCG algorithm with an exact line search, then (|7]i reduces 
to the same fik+i as in the original CG algorithm for linear 
systems Km. 

The preconditioning of the NCG algorithm by the ALS 
algorithm modifies the expressions for fik+i and Pfc+i as 
follows, and is summarized in Algorithm [J] Let x^ be the 
iterate generated from one iteration of the ALS algorithm 
applied to x^, x^ = P(xfc), where P represents one iteration 
of ALS. This iterate is incorporated into the NCG algorithm 
by defining the preconditioned gradient direction generated by 
ALS as 

gfc = Xfc - Xfc = Xfc - P(Xfc), 

and replacing gfc with gfc in (|6]l. Note that —gfc is expected to 
be a descent direction that is an improvement compared to the 
steepest descent direction —gfc. The update parameter fik+i is 
redefined as fik+i with the form 


Pk+l 


Sfc-i-i (gfc-t-1 gfc) 


( 8 ) 








Algorithm 2: Accelerated ALS (ALS-NCG) 
Input: xq 
Output: Xfe 

go ^ xo - P(xo); 

Po ^—lo; 

/c ^ 0; 

while gfc 7 ^ 0 do 

Compute ctfc; 

Xfc+i -t— Xfc + ttfcPfc; 

gfe+i ^ Xfc+i - Pi^k+i)-, 

Compute 

Pfc+l -gfe+i + Pk+iPk', 

/c ■<— fc + 1; 

end 


where ® is similar to (|2l), however not every instance of g^ 
has been replaced with g^,. The specific form for /3fc_|_i is 
chosen because if Algorithm |2] were applied to the convex 
quadratic problem with an exact line search and P were 
represented by a preconditioning matrix P, then Algorithm |2] 
would be equivalent to preconditioned CG with preconditioner 
p. In ina, an overview and an in-depth analysis are given of 
the different possible forms for /3fc+i, however ([8]l performed 
best in our numerical experiments. Note that the primary 
computational cost in Algorithm|2]comes from computing both 
the ALS iteration, P(xfe), and the step length parameter 
using a line search. 

IV. Serial Performance of ALS-NCG 

The ALS and ALS-NCG algorithms were implemented in 
serial in MATLAB, and evaluated using the MovieLens 20M 
dataset 1^ . The entire MovieLens 20M dataset has 138,493 
users, 27,278 movies, and just over 20 million ratings, where 
each user has rated at least 20 movies. To investigate the 
algorithmic performance as a function of problem size, both 
the ALS and ALS-NCG algorithms were run on subsets of 
the MovieLens 20M dataset. In creating subsets, we excluded 
outlier users with either very many or very few movie ratings 
relative to the median number of ratings per user. To construct 
a subset with n„ users, the users from the full dataset were 
sorted in descending order by the values of riu. for each 
user. Denoting the index of the user with the median number 
of ratings by c, the set of users from index c — to 

c+ ([— 1) were included. Once the users were determined, 
the same process was used to select the movies, where the 
ratings per movie were computed only for the chosen users. 

All serial experiments were performed on a linux worksta¬ 
tion with a quad-core 3.16 GHz processor (Xeon X5460) and 8 
GB of RAM. For the ALS-NCG algorithm, the More-Thuente 
line search algorithm from the Poblano toolbox ll28ll was used 
to compute The line search parameters were as follows: 
10“"^ for the sufficient decrease condition tolerance, 10“^ for 
the curvature condition tolerance, and an initial step length 
of 1 and a maximum of 20 iterations. The stopping criteria 
for both ALS and ALS-NCG were the maximum number of 


iterations as well as a desired tolerance value in the gradient 
norm normalized by the number of variables, ;^||gfc|| for 
N = nf X (n„ + rim)- For both algorithms, a normalized 
gradient norm of less than 10“® was required within at most 
10^ iterations. In addition, ALS-NCG had a maximum number 
of allowed function evaluations in the line search equal to 10^. 

The serial tests were performed on ratings matrices of 4 
different sizes: x rim = 400 x 80, 800 x 160, 1600 x 320 

and 3200 x 640, where n„ is the number of users and rim is 
the number of movies. For each ratings matrix, ALS and ALS- 
NCG were used to solve the optimization problem in (|2]i with 
A = 0.1 and n/ = 10. The algorithms were each run using 
20 different random starting iterates (which were the same for 
both algorithms) until one of the stopping criteria was reached. 
Table U summarizes the timing results for different problem 
sizes, where the given times are written in the form a ±b 
where a is the mean time in seconds and b is the standard 
deviation about the mean. Since computing the gradient is not 
explicitly required in the ALS algorithm, the computation time 
for the gradient norm was excluded from the timing results 
for the ALS algorithm. Runs that did not converge based on 
the gradient norm tolerance were not included in the mean 
and standard deviation calculations, however the only run that 
did not converge to ■^||gfc|| < 10“® before reaching the 
maximum number of iterations was a single 1600 x 320 ALS 
run. The large standard deviations in the timing measurements 
stem from the variation in the number of iterations required 
to reduce the gradient norm. The fourth column of Table U 
shows the acceleration factor of ALS-NCG, computed as the 
mean time for convergence of ALS divided by the mean time 
for convergence of ALS-NCG. We see from this table that 
ALS-NCG significantly accelerates the ALS algorithm for all 
problem sizes. Similarly, Table HI] summarizes the number of 
iterations required to reach convergence for each algorithm. 
Again, the results were calculated based on converged runs 
only. 

From Tables HI and UIl it is clear that ALS-NCG accelerates 
the convergence of the ALS algorithm, using the gradient norm 
as the measure of convergence. However, since the factor 


TABLE I 

Timing results of ALS and ALS-NCG. 


Problem Size 

Tlu X TLui 

Time (s) 

Acceleration 

Factor 

ALS 

ALS-NCG 

400 X 80 

56.50 ± 38.06 

12.22 ± 4.25 

4.62 

800 X 160 

162.0 ± 89.94 

47.57 ± 20.02 

3.41 

1600 X 320 

30.61 ± 120.3 

116.2 ± 31.56 

2.84 

3200 X 640 

960.8 ± 364.0 

303.7 ±111.3 

3.16 


TABLE II 

Iteration results of ALS and ALS-NCG. 


Problem Size 

Tlu X TljYi 

Number of Iterations 

Acceleration 

Factor 

ALS 

ALS-NCG 

400 X 80 

2181 ± 1466 

158.8 ± 54.3 

12.74 

800 X 160 

3048 ± 1689 

290.4 ± 128.1 

10.50 

1600 X 320 

3014 ± 1098 

302.9 ± 86.3 

9.95 

3200 X 640 

4231 ± 1602 

329.6 ± 127.5 

12.84 






















matrices U and M are used to make recommendations, we 
would also like to examine the convergence of the algorithms 
in terms of the accuracy of the resultant recommendations. In 
particular, we are interested in the rankings of the top t movies 
(e.g. top 20 movies) for each user, and want to explore how 
these rankings change with increasing number of iterations for 
both ALS and ALS-NCG. If the rankings of the top t movies 
for each user no longer change, then the algorithm has likely 
computed an accurate solution. 

To measure the relative difference in rankings we use a 
metric that is based on the number of pairwise swaps required 
to convert a vector of movie rankings p 2 into another vector 
Pi, but only for the top t movies. We use a modified Kendall- 
Tau 1^ distance to compute the difference between ranking 
vectors based only on the rankings of the top t items, normal¬ 
ize the distance to range in [0,1], and subsequently average 
the distances over all users. To illustrate the ranking metric, 
consider the following example, where pi = [6,3,1, 2,4, 5], 
and p 2 = [3,4,2, 5, 6,1] are two different rankings of movies 
for user i, and let t = 2. We begin by finding the top t movies 
in pi. Here, user i ranks movie 6 highest. In p2, movie 6 
is ranked 5th, and there are 4 pairwise inversions required to 
place movie 6 in the first component of P2, producing a new 
ranking vector for the second iteration, p 2 = [6, 3,4, 2, 5,1]. 
The 2nd highest ranked movie in pi is movie 3. In p 2 , movie 
3 is already ranked 2nd; as such, no further inversions are 
required. In this case, the total number of pairwise swaps 
required to match p 2 to pi for the top 2 rankings is Si = 4. 
The distance Si is normalized to 1 if no inversions were needed 
(i.e. Pi = P 2 in the top t spots) and 0 if the maximum 
number of inversions are needed. The maximum number of 
inversions occurs if p2 and pi are in opposite order, requiring 
rim — 1 inversions in the first step, rim — 2 inversions in the 
second step, and so on until rim — t inversions are needed 
in the f-th step. Thus, the maximum number of inversions is 
^max — ij^m. 2)-f . . .-\-(^rim f) = ^ 

yielding the ranking accuracy metric for user i as 



5max 


The total ranking accuracy for an algorithm is taken as the 
average value of qi across all users relative to ranking of the 
top t movies for each user produced from the solution obtained 
after the algorithm converged. 

The normalized Kendall-Tau ranking metric for the top t 
rankings described above was used to evaluate the accuracy 
of ALS and ALS-NCG as a function of running time for t = 
20. Tables |III] and |IV] summarize the time needed for each 

TABLE III 

Ranking accuracy timing results for problem size 400 x 80 



Fig. 1. Average ranking accuracy versus time for problem size 400 X 80. 

algorithm to reach a specified percentage ranking accuracy for 
ratings matrices with different sizes. Both algorithms were run 
with 20 different random starting iterates, and the time to reach 
a given ranking accuracy is written in the form a±b where a is 
the mean time across all converged runs and b is the standard 
deviation about the mean. In the fourth column of Tables HII] 
and IIVI we have computed the acceleration factor of ALS- 
NCG as the ratio of the mean time for ALS to the mean time 
for ALS-NCG. To reach 70% accuracy, ALS is faster for both 
problem sizes, however for accuracies greater than 70%, ALS- 
NCG converges in significantly less time than ALS. Thus, if 
an accurate solution is desired, ALS-NCG is much faster in 
general than ALS. This is further illustrated in Fig. [T] which 
shows the average time needed for both algorithms to reach a 
given ranking accuracy for the 400 x 80 ratings matrix. Here, 
both ALS and ALS-NCG reach a ranking accuracy of 75% in 
less than a second, however it then takes ALS approximately 
100 s to increase the ranking accuracy from 75% to 100%, 
while ALS-NCG reaches the final ranking in only 14 s. 

V. Parallel Implementation of ALS-NCG in Spark 
A. Apache Spark 

Apache Spark is a fault-tolerant, in-memory cluster comput¬ 
ing framework designed to supersede MapReduce by maintain¬ 
ing program data in memory as much as possible between dis¬ 
tributed operations. The Spark environment is built upon two 
components: a data abstraction, termed a resilient distributed 
dataset (RDD) ll22l . and the task scheduler, which uses a 
delay scheduling algorithm ll^ . We describe the fundamental 
aspects of RDDs and the scheduler below. 

Resilient Distributed Datasets. RDDs are immutable, dis¬ 
tributed datasets that are evaluated lazily via their provenance 
information—that is, their functional relation to other RDDs 
or datasets in stable storage. To describe an RDD, consider 

TABLE IV 

Ranking accuracy timing results for problem size 800 x 160. 


Ranking 

Accuracy 

Time (s) 

Acceleration 

Factor 

ALS 

ALS-NCG 

70% 

0.37 ± 0.17 

0.65 ± 0.16 

0.58 

80% 

7.88 ± 7.03 

2.87 ± 1.72 

2.74 

90% 

37.08 ± 37.62 

6.92 ± 4.47 

5.36 

100% 

101.29 ± 68.04 

14.07 ± 5.25 

7.20 


Ranking 

Accuracy 

Time (s) 

Acceleration 

Factor 

ALS 

ALS-NCG 

70% 

0.76 ± 0.34 

1.43 ± 0.35 

0.53 

80% 

19.83 ± 13.67 

12.61 ± 10.13 

1.57 

90% 

103.91 ± 70.97 

33.25 ± 22.34 

3.12 

100% 

310.98 ± 168.67 

59.88 ± 28.20 

5.19 






















an immutable distributed dataset D of k records with homo¬ 
geneous type: D = IJ^ di with di G V. The distribution of 
D across a computer network of nodes {fa}, such that di 
is stored in memory or on disk on node Va, is termed its 
partitioning according to a partition function P{di) = Va-If D 
is expressible as a finite sequence of deterministic operations 
on other datasets Di,... ,Di that are either RDDs or persistent 
records, then its lineage may be written as a directed acyclic 
graph C formed with the parent datasets {Di} as the vertices, 
and the operations along the edges. Thus, an RDD of type V 
(written RDD [P]) is the tuple {D,P,jC). 

Physically computing the records {di} of an RDD is 
termed its materialization, and is managed by the Spark 
scheduler program. To allocate computational tasks to the 
compute nodes, the scheduler traverses an RDD’s lineage 
graph C and divides the required operations into stages of 
local computations on parent RDD partitions. Suppose that 
Ro = (Uj Xi,Po, Co) were an RDD of numeric type RDD [M], 
and let i?i = {[j^yi, Pi, Ci) be the RDD resulting from the 
application of function / : R —K. to each record of Rq. To 
compute {yi}, Ri has only a single parent in the graph Ci, 
and hence the set of tasks to perform is {f{xi)}. This type 
of operation is termed a map operation. If Pi = Pq, Ci is 
said to have a narrow dependency on Ro'. each yi may be 
computed locally from Xi, and the scheduler would allocate 
the task f{xi) to a node that stores Xi. 

Stages consist only of local map operations, and are 
bounded by shuffle operations that require communication 
and data transfer between the compute nodes. For example, 
shuffling is necessary to perform reduce operations on RDDs, 
wherein a scalar value is produced from an associative binary 
operator applied to each element of the dataset. In implemen¬ 
tation, a shuffle is performed by writing the results of the tasks 
in the preceding stage, {f{xi)}, to a local file buffer. These 
shuffle files may or may not be written to disk, depending on 
the operating system’s page table, and are fetched by remote 
nodes as needed in the subsequent stage. 

Delay Scheduling and Fault Tolerance. The simple de¬ 
lay scheduling algorithm ll^ prioritizes data locality when 
submitting tasks to the available compute nodes. If Va stores 
the needed parent partition Xi to compute task f{xi), but 
is temporarily unavailable due to faults or stochastic delays, 
rather than submitting the task on another node, the scheduler 
will wait until Va is free. However, if Va does not become 
available within a specified maximum delay time (several 
seconds in practice), the scheduler will resubmit the tasks to 
a different compute node. However, as Xi is not available in 
memory on the different node, the lineage Co of the RDD Rq 
must be traversed further, and the tasks required to compute Xi 
from the parent RDDs of Rq will be submitted for computation 
in addition to f{xi). Thus, fault tolerance is achieved in the 
system through recomputation. 

B. ALS Implementation in Spark 

The Apache Spark codebase contains a parallel implemen¬ 
tation of ALS for the collaborative Altering model of ||23, 
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Fig. 2. Hash partitioning of the columns of M, depicted as rectangles, into 
rife blocks Ml,..., Mn{,. Each block Mj^ and its index forms a partition 
of the M RDD of type RDD [{jb, ^jb)] ■ 

0 ; see ll24l . We briefly outline the main execution and 
relevant optimizations of the existing ALS implementation. 
The implementation stores the factor matrices U and M as 
single precision column block matrices, with each block as an 
RDD partition. A U column block stores factor vectors for a 
subset of users, and an M column block stores factor vectors 
for a subset of movies. The ratings matrix is stored twice: 
both R and are stored in separate RDDs, partitioned in 
row blocks (i.e., R is partitioned by users and R^ by movies). 
Row blocks of R and R^ are stored in a compressed sparse 
matrix format. The U and R RDDs have the same partitioning, 
such that a node that stores a partition of U also stores 
the partition of R such that the ratings for each user in its 
partitions of U are locally available. When updating a U block 
according to Q, the required ratings are available in a local R 
block, but the movie factor vectors in M corresponding to the 
movies rated by the users in a local U block must be shuffled 
across the network. These movie factors are fetched from 
different nodes, and, as explained below, an optimized routing 
table strategy is used from ll24ll that avoids sending duplicate 
information. Similarly, updating a block of M according to 
(IHi uses ratings data stored in a local R^ block, but requires 
shuffling of U factor vectors using a second routing table. 

Block Partitioning. All RDDs are partitioned into Ub 
partitions, where Ub is an integer multiple of the number 
of available compute cores in practice. For example, M is 
divided into column blocks with block (movie) index 
jb G { 0 ,... ,nb — 1} by hash partitioning the movie factor 
vectors such that G Rj^ if j = jb (mod Ub) as in 
Fig. |2] Similarly, U is hash partitioned into column blocks 
U 4 with block (user) index ib G {0 ,... ,nb — 1}. The RDDs 
for M and U can be taken as type RDD [(jb, M^-J] and 
RDD [(jb,U^)], where the blocks are tracked by the indices 
jb and ib. R is partitioned by rows (users) into blocks with 
type RDD [(if,, Rife)] with the same partitioning as the RDD 
representing U (and similarly for the R^ and M RDDs). By 
sharing the same user-based partitioning scheme, the blocks 
Rf,^ and U 4 are normally located on the same compute node, 
except when faults occur. The same applies to R^ and 
due to the movie-based partitioning scheme. 

Routing Table. Fig. [3 shows how a routing table optimizes 
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C. ALS-NCG Implementation in Spark 



Fig. 3. Schematic of the use of the routing table Tm(mj) in the Spark 
ALS shuffle. In (a) the blocks are filtered using foi each 

destination and shuffled to the respective blocks in (b), where arrows 
between the shaded backgrounds represent network data transfer between 
different compute nodes. In (c), when updating block the ratings 

information is locally available in Rij,. 


data shuffling in ALS. Suppose we want to update the user 
factor block according to Q. The required ratings data, 
is stored locally, but a global shuffle is required to obtain 
all movie factor vectors in M that correspond to the movies 
rated by the users in 11^^. To optimize the data transfer, a 
routing table Tm(mj) is constructed by determining, for each 
of the movie factor blocks which factor vectors have 

to be sent to each partition of (that may reside on other 
nodes). In Fig.|3](a), the blocks sae filtered using Tmittiij) 
such that a given mj € is written to the buffer destined 
for R,;^, only once regardless of how many in Ui^ 

have ratings for movie j, and only if there is at least one 
in Ui^ that has rated movie j. This is shown by the hatching 
of each rrij vector in Fig. |3](a); for instance, the first column 
in Ml is written only to and has one set of hatching 

lines, but the last column is written to both and 
and correspondingly has two sets of hatching lines. Once the 
buffers are constructed, they are shuffled to the partitions of R, 
as in Fig. |3] (b) such that both the movie factors and ratings 
are locally available to compute the new block, as in 
Fig. [3(c). The routing table formulation for shuffling U, with 
mapping Tu{ui), is analogous. Note that the routing tables are 
constructed before the while loop in Algorithm [T1 and hence 
do not need recomputation in each iteration. 

Eviiliitition of 11 ^ tind txij vis snd (|4j. As in Fig. |3 
(c), a compute node that stores R^^, will obtain ni, buffered 
arrays of Altered movie factors. Once the factors have been 
shuffled, Aj is computed for each as +^^1, 

and Vj as using the Basic Linear Algebra 

Subprograms (BLAS) library OTIl . The resulting linear system 
for Ui is then solved via the Cholesky decomposition using 
LAPACK routines 1^ . giving the computation to update U 
an asymptotic complexity of 0 {nu'n?f) since n„ linear systems 
must be solved. Solving for mj is an identical operation with 
the appropriate routing table and has 0 {nmnj:) complexity. 


We now discuss our contributions in parallelizing ALS- 
NCG for Spark. Since the calculation of ak in Algorithm 
[2] requires a line search, the main technical challenges we 
address are how to compute the loss function in (|TJ and its 
gradient in an efficient way in Spark, obtaining good parallel 
performance. To this end, we formulate a backtracking line 
search procedure that dramatically reduces the cost of multiple 
function evaluations, and we take advantage of the routing 
table mechanism to obtain fast communication. We also extend 
the ALS implementation in Spark to support the additional 
NCG vector operations required in Algorithm [2] using BLAS 
routines. 

Vector Storage. The additional vectors x, g, g, and p 
were each split into two separate RDDs, such that blocks 
corresponding to the components of (see (|5]l) were stored 
in one RDD and partitioned in the same way as U with block 
index 4- Analogously, blocks corresponding to components 
of rtij were stored in another RDD, partitioned in the same 
way as M with block index j),. This ensured that all vector 
blocks were aligned component-wise for vector operations; 
furthermore, the p blocks could also be shuffled efficiently 
using the routing tables in the line search (see below). 

Vector Operations. RDDs have a standard operation termed 
a join, in which two RDDs i?i and i ?2 representing the datasets 
of tuples ljj(fci,ai) and \Jfiki,bi) with type {1C, A) and 
(/C, B), respectively, are combined to produce an RDD i ?3 of 
type RDD [(/C, (.4, i?))], where R 3 = i?i.join(i? 2 ) represents 
the dataset {Jfiki, {ai,bi)) of combined tuples and ki is a 
key common to both Ri and i? 2 - Parallel vector operations 
between RDD representations of blocked vectors x and y 
were implemented by joining the RDDs by their block index 
if, and calling BLAS level 1 interfaces on the vectors within 
the resultant tuples of aligned vector blocks, {(x^jy^)}. 
Since the RDD implementations of vectors had the same 
partitioning schemes, this operation was local to a compute 
node, and hence very inexpensive. One caveat, however, is that 
BLAS subprograms generally perform modifications to vectors 
in place, overwriting their previous components. For fault 
tolerance, RDDs must be immutable; as such, whenever BLAS 
operations were required on the records of an RDD, an entirely 
new vector was allocated in memory and the RDD’s contents 
copied to it. Algorithm [3] shows the operations required for the 
vector addition ax + by using the BLAS . axpby routine (note 
that the result overwrites y in place). The inner product of two 
block vector RDDs and norm of a single block vector RDD 
were implemented in a similar manner to vector addition, with 
an additional reduce operation to sum the scalar component¬ 
wise dot products. These vector operations were used to 
compute fik+i in (HI). 

Line Search & Loss Function Evaluation. We present 
a computationally cheap way to implement a backtracking 
line search in Spark for minimizing dD- A backtracking line 
search minimizes a function /(x) along a descent direction 
p by decreasing the step size in each iteration by a factor of 












































































Algorithm 3: RDD Block Vector BLAS . axpby 

Input: X = RDD [(ib.XiJ]; y = RDD [(i6,yij]; a, fo G R 
Output; z — ax. + by 
z -t- x.join(y).map{ 

Allocate Zi^; 

^ib yifc i 

Call BLAS . axpby(a,Xi^, &, Zi^); 

Yield 

} 


T G (0,1), and terminates once a sufficient decrease in /(x) is 
achieved. This simple procedure is summarized in Algorithm 
m where it is important to note that a line search requires 
computing g once, as well as /(x) in each iteration of the line 
search. To avoid multiple shuffles in each line search iteration, 
instead of performing multiple evaluations of O directly, we 
constructed a polynomial with degree 4 in the step size a by 
expanding ([I]i with = x„, +ap„, and xa.j = x^^ +apmj, 
where x„. refers to the components of x related to user i and 
mutatis mutandis for the vectors p„^, x^^., and Pmj- From 
the bilinearity of the inner product, this polynomial has the 
form Q{a) = ELo (E(i,i)ei C'l"') where the terms 

in the summation for each coefficient only require level 
1 BLAS operations between the block vectors x„., p„., x^^., 
and Pm- for known {i,j) G I. Thus, coefficients of Q{a) 
were computed at the beginning of the line search with 
a single shuffle operation using the routing table Tmimj) 
to match vector pairs with dot products contributing to the 
coefficients. Here, Tm(mj) was chosen since there is far less 
communication required to shuffle {nij}, as Uu 3> Um- Since 
each iteration of the line search was very fast after computing 
the coefficients of Q(a), we used relatively large values of 
T = 0.9, c = 0.5, and ao = 10 in Algorithm |4] that searched 
intensively along direction p. 

Gradient Evaluation. We computed gk with respect to a 
block for Ui using only BLAS level 1 operations as 2Anu. + 

2 Ejgx ~ with an analogous operation with 

respect to each m^. As this computation requires matching 
the Ui and rrij factors, the routing tables T„(ui) and Tm(nij) 
were used to shuffle and rrij consecutively. Evaluating 
gk can be performed with 0{nf{nu Ei + nm Ej ^mj)) 
operations, but requires two shuffles. As such, the gradient 
computation required as much communication as a single 
iteration of ALS in which both U and M are updated. 


Algorithm 4: Backtracking Line Search 
Input: X, p, g, ao, cG (0,1), r G (0,1) 

Output: ak 

fc ^ 0; 

while /(x + akp) - /(x) > akcg^p do 
ctfc+i ■<— T ak] 
k — k -\- \] 

end 


VI. Parallel Performance of ALS-NCG 

Our comparison tests of the ALS and ALS-NCG algorithms 
in Spark were performed on a computing cluster composed 
of 16 homogeneous compute nodes, 1 storage node hosting 
a network filesystem, and 1 head node. The nodes were 
interconnected by a 10 Gb ethernet managed switch (Pow- 
erConnect 8164). Each compute node was a 64 bit rack server 
(PowerEdge R620) running Ubuntu 14.04, with linux kernel 
3.13. The compute nodes all had two 8 -core 2.60 GHz chips 
(Xeon E5-2670) and 256 GB of SDRAM. The head node had 
the same processors as the compute nodes, but had 512 GB of 
RAM. The single storage node (PowerEdge R720) contained 
two 2 GHz processors, each with 6 cores (Xeon E5-2620), 
64 GB of memory, and 12 hard disk drives of 4 TB capacity 
and 7200 RPM nominal speed. Einally, compute nodes were 
equipped with 6 ext4-formatted local SCSI 10k RPM hard disk 
drives, each with a 600 GB capacity. 

Our Apache Spark assembly was built from a snapshot of 
the 1.3 release using Oracle’s Java 7 distribution, Scala 2.10, 
and Hadoop 1.0.4. Input files to Spark programs were stored 
on the storage node in plain text. The SCSI hard drives on the 
compute nodes’ local filesystems were used as Spark spilling 
and scratch directories, and the Spark checkpoint directory 
for persisting RDDs was specified in the network filesystem 
hosted by the storage node, and accessible to all nodes in 
the cluster. Shuffle files were consolidated into larger files, as 
recommended for ext4 filesystems ll^ . In our experiments, 
the Spark master was executed on the head node, and a single 
instance of a Spark executor was created on each compute 
node. It was empirically found that the ideal number of cores 
to make available to the Spark driver was 16 per node, or the 
number of physical cores for a total of 256 available cores. The 
value of rib was set to the number of cores in all experiments. 

To compare the performance of ALS and ALS-NCG in 
Spark, the two implementations were tested on the Movie- 
Lens 20M dataset with A = 0.01 and n/ = 100. In both 
algorithms, the RDDs were checkpointed to persistent storage 
every 10 iterations, since it is a widely known issue in the 
Spark community that RDDs with very long lineage graphs 
cause stack overflow errors when the scheduler recursively 
traverses their lineage ll34ll . Since RDDs are materialized via 
lazy evaluation, to obtain timing measurements, actions were 
triggered to physically compute the partitions of each block 
vector RDD at the end of each iteration in Algorithm [T] and 
|2] Eor each experimental run of ALS and ALS-NCG, two 
experiments with the same initial user and movie factors were 
performed: in one, the gradient norm in each iteration was 
computed and printed (incurring additional operations); in the 
other experiment, no additional computations were performed 
such that the elapsed times for each iteration were correctly 
measured. 

Fig. in shows the convergence in gradient norm, normalized 
by the degrees of freedom N = rif x {riu + rim), for six 
separate runs of ALS and ALS-NCG on 8 compute nodes for 
the MovieLens 20M dataset. The subplots (a) and (b) show 











Fig. 4. Convergence in normalized gradient norm ||gfe || for 6 instances of 
ALS and ALS-NCG with different starting values in both (a) iteration and (b) 
clock time, for the MovieLens 20M dataset. The two solid lines in each panel 
show actual convergence traces for one of the instances, while the shaded 
regions show the standard deviation about the mean value over all instances, 
computed for non-overlapping windows of 3 iterations for (a), and 30 s for 
(b). The experiments were conducted on 8 nodes (128 cores). 

^Ilgfcll over 100 iterations and 25 minutes, respectively. This 
time frame was chosen since it took just over 20 minutes for 
ALS-NCG to complete 100 iterations; note that in Fig.|4](b), 
200 iterations of ALS are shown, since with this problem 
size it took approximately twice as long to run a single 
iteration of ALS-NCG. The shaded regions in Fig. |4] show the 
standard deviation about the mean value for ;^||gfc|| across all 
runs, computed for non-overlapping windows of 3 iterations 
for subplot (a) and 30 s for subplot (b). Even within the 
uncertainty bounds, ALS-NCG requires much less time and 
many fewer iterations than ALS to reach accurate values of 
;^l|gfe|| (e.g. below lO"^). 

The operations that we have implemented in ALS-NCG that 
are additional to standard ALS in Spark have computational 
complexity that is linear in problem size. To verify the 
expected linear scaling, experiments with a constant number 
of nodes and increasing problem size up to 800 million 
ratings were performed on 16 compute nodes. Synthetic ratings 
matrices were constructed by sampling the MovieLens 20M 
dataset such that the synthetic dataset had the same statistical 
sparsity, realistically simulating the data transfer patterns in 
each iteration. To do this, first the dimension n„ of the 
sampled dataset was fixed, and for each user was sampled 
from the empirical probability distribution p(n„JR) of how 
many movies each user ranked, computed empirically from the 
MovieLens 20M dataset. The movies were then sampled 
from the empirical likelihood of sampling the jth movie, 
p(mj|R), and the resultant rating value was sampled from 
the distribution of numerical values for all ratings. The choice 
of scaling up the users (for a fixed set of movies) was made 



Fig. 5. Linear scaling of computation time per iteration with increasing 
Tiu on a synthetic dataset for ALS and ALS-NCG on 16 compute nodes (256 
cores) for up to 6M users, corresponding to 800M ratings. 

to model the situation in which an industry’s user base grows 
far more rapidly than its items. 

Fig. m shows the linear scaling in computation time for both 
ALS and ALS-NCG. The values shown are average times 
per iteration over 50 iterations, for from 1 to 6 million, 
corresponding to the range from 133 to 800 million ratings. 
The error bars show the uncertainty in this measurement, 
where the standard deviation takes into account that iterations 
with and without checkpointing come from two different 
populations with different average times. The uncertainty in 
the time per iteration for ALS-NCG is larger due to the greater 
overhead of memory management and garbage collection by 
the Java Virtual Machine required with more RDDs. While 
each iteration of ALS-NCG takes longer due to the additional 
line search and gradient computations, we note that many 
fewer iterations are required to converge. 

Finally, we compute the relative speedup that was attainable 
on the large synthetic datasets. For the value of ||gfe|| in each 
iteration of ALS-NCG, we determined how many iterations of 
regular ALS were required to achieve an equal or lesser value 
gradient norm. Due to the local variation in ;^||gfc|| (as in 
Fig. |4]i, a moving average filter over every two iterations was 
applied to the ALS-NCG gradient norm values. The total time 
required for ALS and ALS-NCG to reach a given gradient 
norm was then estimated from the average times per iteration 
in Fig. |5] The ratios of these total times for ALS and ALS- 
NCG are shown in Fig. |6] as the relative speedup factor 
for the IM, 3M, and 6M users ratings matrices. When an 
accurate solution is desired, ALS-NCG often achieves faster 
convergence by a factor of 3 to 5, with the acceleration factor 
increasing with greater desired accuracy in the solution. 

VII. Conclusion 

In this paper, we have demonstrated how the addition of 
a nonlinear conjugate gradient wrapper can accelerate the 
convergence of ALS-based collaborative filtering algorithms 
when accurate solutions are desired. Furthermore, our parallel 
ALS-NCG implementation can significantly speed up big data 
recommendation in the Apache Spark distributed computing 
environment, and the proposed NCG acceleration can be nat¬ 
urally extended to Hadoop or MPI environments. It may also 
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Fig. 6. Speedup of ALS-NCG over ALS as a function of normalized gradient 
norm on 16 compute nodes (256 cores), for a synthetic problem with up to 6M 
users and 800M ratings. ALS-NCG can easily outperform ALS by a factor 
of 4 or more, especially when accurate solutions (small normalized gradients) 
are required or problem sizes are large. 


be applicable to other optimization methods for collaborative 
hltering. Though we have focused on a simple latent factor 
model, our acceleration can be used with any collaborative 
hltering model that uses ALS. We expect that our acceleration 
approach will be especially useful for advanced collaborative 
hltering models that achieve low root mean square error 
(RMSE), since these models require solving the optimization 
problem accurately, and that is precisely where accelerated 
ALS-NCG shows the most beneht over standalone ALS. 
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